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Abstract 

This paper studies the subspace segmentation problem. 
Given a set of data points drawn from a union of subspaces, 
the goal is to partition them into their underlying subspaces 
they were drawn from. The spectral clustering method is 
used as the framework. It requires to find an affinity ma¬ 
trix which is close to block diagonal, with nonzero entries 
corresponding to the data point pairs from the same sub¬ 
space. In this work, we argue that both sparsity and the 
grouping effect are important for subspace segmentation. 
A sparse affinity matrix tends to be block diagonal, with 
less connections between data points from different sub¬ 
spaces. The grouping effect ensures that the highly cor¬ 
rected data which are usually from the same subspace can 
be grouped together. Sparse Subspace Clustering (SSC), 
by using I 1 -minimization, encourages sparsity for data se¬ 
lection, but it lacks of the grouping effect. On the contrary, 
Low-Rank Representation (LRR), by rank minimization, and 
Least Squares Regression (LSR), by I 2 -regularization, ex¬ 
hibit strong grouping effect, but they are short in subset se¬ 
lection. Thus the obtained affinity matrix is usually very 
sparse by SSC, yet very dense by LRR and LSR. 

In this work, we propose the Correlation Adaptive Sub¬ 
space Segmentation (CASS) method by using trace Lasso. 
CASS is a data correlation dependent method which simul¬ 
taneously performs automatic data selection and groups 
correlated data together. It can be regarded as a method 
which adaptively balances SSC and LSR. Both theoretical 
and experimental results show the effectiveness of CASS. 

1. Introduction 

This paper focuses on subspace segmentation, the goal 
of which is to segment a given data set into clusters, ideally 
with each cluster corresponding to a subspace. Subspace 
segmentation is an important problem in both computer vi¬ 
sion and machine learning literature. It has numerous appli- 
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Figure 1. Example on a subset with 10 subjects of the Extended 
Yale B database. For a given data point y and a data set X, y 
can be approximately expressed as a liner representation of all the 
columns of X by different methods. This figure shows the abso¬ 
lute values of the representation coefficients (normalized to [0 1] 
for ease of display) derived by SSC, LRR, LSR and the proposed 
CASS. Here different columns in each subfigure indicate different 
subjects. The red color coefficients correspond to the face images 
which are from the same subject as y. One can see that the coef¬ 
ficients derived by SSC are very sparse, and only limited samples 
within cluster are selected to represent y. Both LRR and LSR lead 
to dense representations. They not only group data within clus¬ 
ter together, but also between clusters. For CASS, most of large 
coefficients concentrate on the data points within cluster. Thus it 
approximately reveals the true segmentation of data. Images in 
this paper are best viewed on screen! 

cations, such as motion segmentation da, face clustering 
m, and image segmentation m, owing to the fact that the 
real-world data often approximately lie in a mixture of sub¬ 
spaces. The problem is formally defined as follows E3: 

Definition 1 (Subspace Segmentation) Given a set of suffi¬ 
ciently sampled data vectors X = [xi, • • • ,x n \ G R dXn , 
where d is the feature dimension, and n is the number of 
data vectors. Assume that the data are drawn from a union 
ofk subspaces of unknown dimensions {?y z =1 , re¬ 

spectively. The task is to segment the data according to the 
underlying subspaces they are drawn from. 

1.1. Summary of notations 

Some notations are used in this work. We use capital 
and lowercase symbols to represent matrices and vectors, 
respectively. In particular, 1^ G R d denotes the vector of all 
l’s, Ci is a vector whose i-th entry is 1 and 0 for others, and 






















































I is used to denote the identity matrix. Diag(u) converts the 
vector v into a diagonal matrix in which the i-th diagonal 
entry is diag(A) is a vector whose i-th entry is An of 
a square matrix A. tr(A) is the trace of a square matrix A. 
Ai denotes the i- th column of a matrix A. sign(x) is the 
sign function defined as sign(x) = x/\x\if x ^ 0 and 0 for 
otherwise, v vq denotes that v converges to vq. 

Some vector and matrix norms will be used. |M|o, 
||u||i, ||u11 2 and ||u||oo denote the ^°-norm (number of 
nonzero entries), f^-norm (sum of the absolute vale of 
each entry), ^ 2 -norm and £°°-norm of a vector v. \\A\\i, 
\\A\\ F, | \ A\ 1 2 , l ? Halloo, and ||A||* denote the ^-norm 
CEij I Aij\), Frobenius norm, ^-norm (JX ||^|| 2 ), £°°~ 
norm (max*j |A^-|), and nuclear norm (the sum of all the 
singular values) of a matrix A, respectively. 

1.2. Related work 

There has been a large body of research on subspace 
segmentation E3m[i3im[nii5][8). Most recently, the 
Sparse Subspace Clustering (SSC) urn, Low-Rank Rep¬ 
resentation (LRR) [L3, l2l l2l. and Least Squares Regres¬ 
sion (LSR) H3 techniques have been proposed for sub¬ 
space segmentation and attracted much attention. These 
methods learn an affinity matrix whose entries measure the 
similarities among the data points and then perform spec¬ 
tral clustering on the affinity matrix to segment data. Ide¬ 
ally, the affinity matrix should be block diagonal (or block 
sparse in vector form), with nonzero entries correspond¬ 
ing to data point pairs from the same subspace. A typical 
choice for the measure of similarity between Xi and Xj is 
Wij = exp (—| \x{ — Xj 11/a), where cr > 0. However, such 
method is unable to utilize the underlying linear subspace 
structure of data. The constructed affinity matrix is usually 
not block diagonal even under certain strong assumptions, 
e.g. independent subspaces [] For a new point y E R d in 
the subspaces, SSC pursues a sparse representation: 

min||u;||i s.t .y — Xw. (1) 

w 

Problem Q can be extended for handling the data with 
noise, which leads to the popular Lasso l22ll formulation: 

min \\y-Xw\\l + \\\w\\i, (2) 

W 

where A > 0 is a parameter. SSC solves problem Q or 
for each data point y in the dataset with all the other data 
points as the dictionary. Then it uses the derived represen¬ 
tation coefficients to measure the similarities between data 
points and constructs the affinity matrix. It is shown that, if 
the subspaces are independent, the sparse representation is 
block sparse. However, if the data from the same subspace 

1 A collection of k linear subspaces are independent if and 

only if Si n = 1°} for all i (or Y!i=i = ©£=!<%). 


are highly correlated or clustered, the i 1 -minimization will 
generally select a single representative at random, and ig¬ 
nore other correlated data. This leads to a sparse solution 
but misses data correlation information. Thus SSC may re¬ 
sult in a sparse affinity matrix but lead to unsatisfactory per¬ 
formance. 

Low-Rank Representation (LRR) is a method which 
aims to group the correlated data together. It solves the fol¬ 
lowing convex optimization problem: 

min IIWII* s.t. X = XW. (3) 

w 

The above problem can be extended for the noisy case: 

min ||W||*+ A||B|| 2| i 

(4) 

s.t. X = XW + E , 

where A > 0 is a parameter. Although LRR guarantees to 
produce a block diagonal solution when the data are noise 
free and drawn from independent subspaces, the real data 
are usually contaminated with noises or outliers. So the 
solution to problem ([4]) is usually very dense and far from 
block diagonal. The reason is that the nuclear norm min¬ 
imization lacks the ability of subset selection. Thus, LRR 
generally groups correlated data together, but sparsity can¬ 
not be achieved. 

In the context of statistics, Ridge regression (£ 2 - 
regularization) Col may have the similar behavior as LRR. 
Below is the most recent work by using Least Squares Re¬ 
gression (LSR) ca for subspace segmentation: 

mm\\X-XW \\ 2 f + \\\W\\ 2 f . (5) 

w 

Both LRR and LSR encourage grouping effect but lack of 
sparsity. In fact, for subspace segmentation, both sparsity 
and grouping effect are very important. Ideally, the affin¬ 
ity matrix should be sparse, with no connection between 
clusters. On the other hand, the affinity matrix should not 
be too sparse, i. e ., the nonzero connections within cluster 
should be sufficient enough for grouping correlated data in 
the same subspaces. Thus, it is expected that the model can 
automatically group the correlated data within cluster (like 
LRR and LSR) and eliminate the connections between clus¬ 
ters (like SSC). Trace Lasso ED , defined as ||XDiag(u;)||*, 
is such a newly established regularizer which interpolates 
between the f^-norm and ^ 2 -norm of w. It is adaptive and 
depends on the correlation among the samples in X , which 
can be encoded by X T X. In particular, when the data are 
highly correlated (X T X is close to 11 T ), it will be close to 
the £ 2 -novm, while when the data are almost uncorrelated 
(X T X is close to /), it will behave like the f^-norm. We 
take the adaptive advantage of trace Lasso to regularize the 
representation coefficient matrix, and define an affinity ma¬ 
trix by applying spectral clustering to the normalized Lapla- 
cian. Such a model is called Correlation Adaptive Subspace 



Segmentation (CASS) in this work. CASS can be regarded 
as a method which adaptively interpolates SSC and LSR. 
An intuitive comparison of the coefficient matrices derived 
by these four methods can be found in Figure [T] For CASS, 
we can see that most large representation coefficients clus¬ 
ter on the data points from the same subspace as y. In com¬ 
parison, the connections within cluster are very sparse by 
SSC, and the connections between clusters are very dense 
by LRR and LSR. 

1.3. Contributions 

We summarize the contributions of this paper as follows: 

• We propose a new subspace segmentation method, 
called the Correlation Adaptive Subspace Segmenta¬ 
tion (CASS), by using trace Lasso 0. CASS is the 
first method that takes the data correlation into account 
for subspace segmentation. So it is self-adaptive for 
different types of data. 

• In theory, we show that if the data are from inde¬ 
pendent subspaces, and the objective function satisfies 
the proposed Enforced Block Sparse (EBS) conditions, 
then the obtained solution is block sparse. Trace Lasso 
is a special case which satisfies the EBS conditions. 

• We theoretically prove that trace Lasso has the group¬ 
ing effect, i.e., the coefficients of a group of correlated 
data are approximately equal. 

2. Correlation Adaptive Subspace Segmenta¬ 
tion by Trace Lasso 

Trace Lasso ID is a recently proposed norm which bal¬ 
ances the £ 1 -norm and ^ 2 -norm. It is formally defined as 

n(w) = ||XDiag(u;)||*. 

A main difference between trace Lasso and the existing 
norms is that trace Lasso involves the data matrix X , which 
makes it adaptive to the correlation of data. Actually, it 
only depends on the matrix X T X of data, which encodes 
the correlation information among data. In particular, if the 
norm of each column of X is normalized to one, we have 
the following decomposition of XDia g(w): 

n 

XDia g(w) = \wi\(sign(wi)xi)ef . 

i=1 

If the data are uncorrelated (the data points are orthogonal, 
X T X = /), the above equation gives the singular value 
decomposition of XDia g(w). In this case, trace Lasso is 
equal to the t^-norm: 

n 

||XDiag(w)||* = ||Diag(w)||* = ^ K| = |M|i- 


If the data are highly correlated (the data points are all the 
same, X = x\l T , X T X = 11 T ), trace Lasso is equal to 
the £ 2 -norm: 

||XDiag(w)||* = | \x\w T \\* = ||xi|| 2 |M |2 = IMk- 

For other cases, trace Lasso interpolates between the £ 2 - 
norm and t^-norm Cl: 

IMk < ||XDiag(ty)||* < IMIi- 

We use trace Lasso for subset selection from all the data 
adaptively, which leads to the Correlation Adaptive Sub¬ 
space Segmentation (CASS) method. We first consider the 
subspace segmentation problem with clean data by CASS 
and then extend it to the noisy case. 

2.1. CASS with clean data 

Let X = [xi,---,x n ] = [Xl, • • • , Xk]T be a set of 
data drawn from k subspaces {Si}^ =1 , where Xi denotes 
a collection of rii data points from the i-th subspace Si, 
n = Yli=i n ii an d T is a hypothesized permutation ma¬ 
trix which rearranges the data to the true segmentation of 
data. For a given data point y G Si, it can be represented 
as a linear combination of all the data points X. Different 
from the previous methods in SSC, LRR and LSR, CASS 
uses the trace Lasso as the objective function and solves the 
following problem: 

min ||XDiag(w)||* s.t. y = Xw. ( 6 ) 

w£R n 

The methods, SSC, LRR and LSR, show that if the 
data are sufficiently sampled from independent subspaces, a 
block diagonal solution can be achieved. The work [HI fur¬ 
ther shows that it is easy to get a block diagonal solution if 
the objective function satisfies the Enforced Block Diagonal 
(EBD) conditions. But the EBD conditions cannot be ap¬ 
plied to trace Lasso directly, since trace Lasso is a function 
involving both the data X and w. Here we extend the EBD 
conditions m to the Enforced Block Sparse (EBS) con¬ 
ditions and show that the obtained solution is block sparse 
when the objective function satisfies the EBS conditions. 
Trace Lasso is a special case which satisfies the EBS condi¬ 
tions and thus leads to a block sparse solution. 

Enforced Block Sparse (EBS) Conditions. Assume / 
is a function with regard to a matrix X G R dxn and a vector 
w — [w a ; Wb] w c \ G M n , w 7 ^ 0. Let w B = [0; 0] G M n . 

The EBS conditions are: 

(1) f(X,w) = f(XP,P~ 1 w ), for any permutation ma¬ 
trix P G M nxn ; 

( 2 ) f{X,w) > f(X,w B ), and the equality holds if and 
only if w = w B . 


For some cases, the EBS conditions can be regarded as ex¬ 
tensions of the EBD conditions @ The EBS conditions will 
enforce the solution to the following problem 


Lemma 1 /7j§] Lemma 11] Let A G R dxn be partitioned 
in the form A = [Ai, A 2 \. Then ||A||* > ||Ai||* and the 
equality holds if and only if A^ = 0 . 


min f(X,w) s.t. y = Xw, (7) 

w 

to be block sparse when the subspace are independent. 

Theorem 1 Let X = [xi, • • • , x n \ = [Xi, • • • , X^JT G 
M dxn be a data matrix whose column vectors are suffi¬ 
ciently 0 drawn from a union of k independent subspaces 
{Si}i = i, xj 0, j = 1, • • • , n. For each i, Xi G R dxni 
and n = Yli=i n i- Let y G R d be a new point in Sp Then 
the solution to problem Q re* = T -1 [zjf; • • • ; zf\ G M n is 
block sparse, i.e., z* 0 and Zj = 0 for all j i. 

Proof For y G Si, let w* = T _1 [zjf; • • • ; zf,\ be the optimal 
solution to problem ([7]), where z* G R Ui corresponds to Xi 
for each i = 1, • • • ,k. We decompose w* into two parts 
w* u* + v*, where u* = 1 [ 0 ; • • • ; z*; • • • ; 0 ] and 

v* =T ; 0; • • • ;z*]. We have 

y = Xic* = Xu* + X?;* 

Since y e Si and X^z* G Si, y — XiZ* G <%. Thus 
Xyz* = y - X^* G D ©j/icSj. Considering that 
the subspaces {5i}jL 1 are independent, = { 0 }, 

we have y = X*z* = Xu* and Xyz* = 0, j i. So u* is 
feasible to problem 0 - On the other hand, by the definition 
of u* and the EBS conditions (2), we have 

f(X,w*)>f(X,u*). 


In a similar way, CASS owns the block sparse property: 

Theorem2 Let X = [xi,--- ,x n \ = [Xi,--- ,X&]r G 
M dxn be a data matrix whose column vectors are suffi¬ 
ciently drawn from a union of k independent subspaces 
{Si}!f =1 , Xj 0, j — 1 , • • • , n. For each i, Xi G R dxni 
and n = Yli=i n i- Let y be a new point in Si. It holds that 
the solution to problem |d|) w* = T~*[z*; • ■ • ; zf] G M n 
is block sparse, i.e., z* 0 and z*j = 0 for all j i. 
Furthermore, z* is also optimal to the following problem: 

min \\XiDiag(zi)\\* s.t. y = X*z*. ( 8 ) 

ZiER n i 


The block sparse property of CASS is the same as those 
of SSC, LRR and LSR when the data are from indepen¬ 
dent subspaces. This is also the motivation for using trace 
Lasso for subspace segmentation. For the noisy case, dif¬ 
ferent from the previous methods, CASS may also lead to a 
solution which is close to block sparse, and it also has the 
grouping effect (see Section 2.3). 

2.2. CASS with noisy data 


The noise free and independent subspaces assumption 
may be violated in real applications. Problem 0 can be ex¬ 
tended to handle noises of different types. For small magni¬ 
tude and dense noises ( e.g. Gaussian), a reasonable strategy 
is to use the ^ 2 -norm to model the noises: 


min ^lly-^Hl 2 + A ll^ Dia sMII*- (9) 

w Z 


Noticing that w* is optimal to problem 0 , < 

/(X, u*). Thus the equality holds. By the EBS conditions 
(2), we get w* = u*. Therefore, z* / 0, and z* = 0 for all 
3* L m 

The EBS conditions greatly extend the family of the ob¬ 
jective function which involves the block sparse property. 
It is easy to check that trace Lasso satisfies the EBS condi¬ 
tions. Let /(X, w) = ||XDiag(rc)||*, for any permutation 
matrix P G M nxn , 

f(XP,P~ l w) = | | XPDiag(P _1 «;) ||* 

= ||XPP- 1 DiagH||* 

= ||XDiag(ui)||* = f(X,w). 

Trace Lasso also satisfies the EBS conditions (2) by the fol¬ 
lowing lemma: 

2 For example, /(X, w) = ||m|| p + 0x |X||f = ||w||p = g(w), 
where p > 0. It is easy to see that /(X, w) satisfies the EBS conditions 
and g(w) satisfies the EBD conditions. 

3 That the data sampling is sufficient makes sure that problem jT} has a 
feasible solution. 


Here A > 0 is a parameter balancing the effects of the two 
terms. For data with a small fraction of gross corruptions, 
the i 1 -norm is a better choice: 

min \\y - Xw\\± + A||XDiag(w)||*. (10) 

W 

Namely, the choice of the norm depends on the noises. It is 
important for subspace segmentation but not the main focus 
of this paper. 

In the case of data contaminated with noises, it is difficult 
to obtain a block sparse solution. Though the representation 
coefficient derived by SSC tends to be sparse, it is unable to 
group correlated data together. On the other hand, LRR and 
LSR lead to dense representations which lack the ability of 
subset selection. CASS by using trace Lasso takes the corre¬ 
lation of data into account which places a tradeoff between 
sparsity and grouping effect. Thus it can be regarded as a 
method which balances SSC and LSR. 

For SSC, LRR, LSR and CASS, each data point is ex¬ 
pressed as a linear combination of all the data with a co¬ 
efficient vector. These coefficient vectors can be arranged 





Figure 2. The affinity matrices derived by (a) SSC, (b) LRR, (c) 
LSR, and (d) CASS on the Extended Yale B Database (10 sub¬ 
jects). 


as a matrix measuring the similarities between data points. 
Figure [2] illustrates the coefficient matrices derived by these 
four methods on the Extended Yale B database (see Sec¬ 
tion [3J] for detailed experimental setting). We can see that 
the coefficient matrix derived by SSC is so sparse that it is 
even difficult to identify how many groups there are. This 
phenomenon confirms that SSC loses the data correlation 
information. Thus SSC does not perform well for data with 
strong correlation. On the contrary, the coefficient matri¬ 
ces derived by LRR and LSR are very dense. They group 
many data points together, but do not do subset selection. 
There are many nonzero connections between clusters, and 
some are very large. Thus LRR and LSR may contain much 
erroneous information. Our proposed method CASS by us¬ 
ing trace Lasso, achieves a more accurate coefficient ma¬ 
trix, which is close to be block diagonal, and it also groups 
data within cluster. Such intuition shows that CASS is more 
accurate to reveal the true data structure for subspace seg¬ 
mentation. 

2.3. The grouping effect 

It has been shown in (H that the effectiveness of LSR by 
I 2 -regularization comes from the grouping effect, i.e., the 
coefficients of a group of correlated data are approximately 
equal. In this work, we show that trace Lasso also has the 
grouping effect for correlated data. 

Theorem 3 Given a data vector y E R d , data points X = 
[xi, • • • ,x n \ E R dxn and parameter A > 0. Let w* = 
[wf • • • ,u;*] T E M n be the optimal solution to problem 
(JPJ). If Xi -A Xj, then w* -A w*. 

The proof of the Theorem [3] can be found in the supple¬ 
mentary materials. 

If each column of X is normalized, X{ — Xj implies that 
the sample correlation r = xfxj = 1. Namely xi and 
Xj are highly correlated. Then these two data points will 
be grouped together by CASS due to the grouping effect. 
Illustrations of the grouping effect are shown in Ligures [T] 
and [2] One can see that the connections within cluster by 
CASS are dense, similar to LRR and LSR. The grouping 
effect of CASS may be weaker than LRR and LSR, since it 


Algorithm 1 Solving Problem ([9]) by ADM 
Input: data matrix X , parameter A. 

Initialize: w°, Y°, pP, p, Pmax, e,t = 0 . 

Output: coefficient w*. 
while not converge do 

1. fix the others and update J by 

J t+1 = argmin ^|| J||* + \\\J — (XDia g(w t ) - 

2 . fix the others and update w by 

w t+1 = A(X T y + dia g(X T (Y t + ^ J t+1 ))), 
where A = (X T X + /i t Diag(diag(X T W))) _1 . 

3. update the multiplier 

yt+i = yt + 1 _ XDia g(w t+1 )). 

4. update the parameter by // +1 = min (pp?, Pmax )• 

5. check the convergence conditions 

\\J t+1 ~ X\\oo < £, 

\\w t+1 - 10*1100 < e, 

|| J t+1 — XDiag(w t+1 )|| 00 < e. 

6 . t — t- hi. 

end while 


also encourages sparsity between clusters, but it is sufficient 
enough for grouping correlated data together. 

2.4. Optimization 

Performing CASS needs to solve the convex optimiza¬ 
tion problem which can be optimized by off-the- 
shelf solvers. The work in ci introduces an iteratively 
reweighted least squares method for solving problem 
but the solution is not necessarily globally optimal due to 
a trick by adding a term to avoid the non-invertible issue. 
Motivated by the optimization method used in low-rank 
minimization cma, we adopt the Alternating Direction 
Method (ADM) to solve problem We first convert it to 
the following equivalent problem: 

mm h\y~Xw\\l + A|| J||* 

J,w 2 (11) 

s.t. J = XDia g(w). 

This problem can be solved by the ADM method, which 
operates on the following augmented Lagrangian function: 

L(J,w) = ±\\y-Xw\\ 2 2 + \\\J\U 

+tr(Y T (J - XDiag(«;))) + f || J — XDmg(w)\\ 2 F , 

( 12 ) 

where Y E W ,/n is the Lagrange multiplier and // > 0 is 
the penalty parameter for violation of the linear constraint. 





Algorithm 2 Correlation Adaptive Subspace Segmentation 
Input: data matrix X , number of subspaces k 

1. Solve problem ([9} for each data point in X to obtain 

the coefficient matrix VF*, where X in should be 
replaces by X- { = [xi,--- ux i+1 ,-- ,x n \. 

2. Construct the affinity matrix by (| VF* | + | VF* T |)/ 2 . 

3. Segment the data into k groups by Normalized Cuts. 


We can see that L(J,w) is separable, thus it can be decom¬ 
posed into two subproblems and minimized with regard to 
J and w, respectively. The whole procedure for solving 
problem ([9]) is outlined in the Algorithm [T] It iteratively 
solves two subproblems which have closed form solutions. 
By the theory of ADM and the convexity of problem ([9]), 
Algorithm [T] converges globally. 

2.5. The segmentation algorithm 

For solving the subspace segmentation problem by trace 
Lasso, we first solve problem ([9]) for each data point Xi with 
X\ = [xi, * • • , Xi-i,Xi+i, ■ • • , x n \ which excludes Xi it¬ 
self, and obtain the corresponding coefficients. Then these 
coefficients can be arranged as a matrix VF*. The affinity 
matrix is defined as (|VF*| + |VF* T |)/2. Finally, we use 
the Normalized Cuts (NCuts) l20l to segment the data into 
k groups. The whole procedure of CASS algorithm is out¬ 
lined in the Algorithm [2] 

3. Experiments 

In this section, we apply CASS for subspace segmenta¬ 
tion on three databases: the Hopkins 155 [^motion database, 
Extended Yale B database ( 6 ) and MNIST database of 
handwritten digits. CASS is compared with SSC, LRR and 
LSR which are the representative and state-of-the-art meth¬ 
ods for subspace segmentation. The derived affinity ma¬ 
trices from all algorithms are also evaluated for the semi- 
supervised learning task on the Extended Yale B database. 
For fair comparison with previous works, we follow the 
experimental settings as in Q5|. The parameters for each 
method are tuned to achieve the best performance. The seg¬ 
mentation accuracy/error is used to evaluate the subspace 
segmentation performance. The accuracy is calculated by 
the best matching rate of the predicted label and the ground 
truth of data m 

3.1. Data sets and experimental settings 

Hopkins 155 motion database contains 156 sequences, 
each of which has 39^550 data points drawn from two or 

4 http ://w w w. vision .j hu. edu/data/hopkins 155/ 

5 http://yann.lecun.com/exdb/mnist/ 


Table 1. The segmentation errors (%) on the Hopkins 155 
database._ 



Comparison under the same setting 
kNN SSC LRR LSR 

CASS 

MAX 

45.59 

39.53 

36.36 

36.36 

32.85 

MEAN 

13.44 

4.02 

3.23 

2.50 

2.42 

STD 

12.90 

10.04 

6.60 

5.62 

5.84 

Comparison to state-of-the-art methods 



SSC 

LRR 

LatLRR 

CASS 

MEAN 


2.18 

1.71 

0.85 

1.47 


Table 2. The segmentation accuracies (%) on the Extended Yale B 
database. _ 



kNN 

SSC 

LRR 

LSR 

CASS 

5 subjects 

56.88 

80.31 

86.56 

92.19 

94.03 

8 subjects 

52.34 

62.90 

78.91 

80.66 

91.41 

10 subjects 

50.94 

52.19 

65.00 

73.59 

81.88 


three motions (a motion corresponds to a subspace). Each 
sequence is a sole data set and so there are 156 subspace 
segmentation problems in total. We first use PC A to project 
the data into a 12-dimensional subspace. All the algorithms 
are performed on each sequence, and the maximum, mean 
and standard deviation of the error rates are reported. 

Extended Yale B is challenging for subspace segmenta¬ 
tion due to large noises. It consists of 2,414 frontal face im¬ 
ages of 38 subjects under various lighting, poses and illumi¬ 
nation conditions. Each subject has 64 faces. We construct 
three subspace segmentation tasks based on the first 5, 8 and 
10 subjects face images of this database. The data are first 
projected into a 5 x 6 , 8 x 6 , and 10 x 6 -dimensional subspace 
by PCA, respectively. Then the algorithms are employed on 
these three tasks and the accuracies are reported. 

To further evaluate the effectiveness of CASS for other 
learning problems, we also use the derived affinity matrix 
for semi-supervised learning. The Markov random walks 
algorithm ll 2 Tft is employed in this experiment. It performs a 
t-step Markov random walk on the graph or affinity matrix. 
The influence of one example to another example is propor¬ 
tional to the affinity between them. We test on the 10 sub¬ 
jects face classification problem. For each subject, 4, 8 , 16 
and 32 face images are randomly selected to form the train¬ 
ing data set, and the remaining for testing. Our goal is to 
predict the labels of the test data by Markov random walks 
tm on the affinity matrices learnt by fcNN, SSC, LRR, LSR 
and CASS. We experimentally select k = 6 neighbors. The 
experiment is repeated for 20 times, and the accuracy and 
standard deviation are reported for evaluation. 

MNIST database of handwritten digits is also widely 
used in subspace learning and clustering CD. It has 10 
subjects, corresponding to 10 handwritten digits, 0^9. We 
select a subset with a similar size as in the above face clus¬ 
tersing problem for this experiment, which consists of the 





















Figure 3. Comparison of classification accuracy (%) and standard 
deviation of different semi-supervised learning based on different 
affinity matrices on the Extended Yale B (10 subjects) database. 


first 50 samples of each subject. The accuracies of SSC, 
LRR, LSR and CASS are reported. 

3.2. Experimental results 

Table [T] tabulates the motion segmentation errors of four 
methods on the Hopkins 155 database. It shows that CASS 
gets a misclassification error of 2.42% for all 156 se¬ 
quences, while the best previously reported result is 2.50% 
by LSR. The improvement of CASS on this database is lim¬ 
ited due to many reasons. First, previous methods have per¬ 
formed very well on the data with only slight corruptions, 
and thus the room for improvement is limited. Second, the 
reported error is the mean of 156 segmentation errors, most 
of which are zeros. So even if there are some high im¬ 
provements on some challenging sequences, the improve¬ 
ment of the mean error is also limited. Third, the correla¬ 
tion of data is strong as the dimension of each affine sub¬ 
space is no more than three Q lfl6l . thus CASS tends to be 
close to LSR in this case. Due to the dimensionality reduc¬ 
tion by PCA and sufficient data sampling in each motion, 
CASS may behave like LSR with a strong grouping effect. 
Furthermore, in order to compare with the state-of-the-art 
methods, we follow the post-processing in [ 121 , which may 
not be optimal for CASS, and the error of CASS is reduced 
to 1.47%. But the best performance by Latent LRR |[14) is 
0.85%. It is much better than other methods. That is be¬ 
cause Latent LRR further employs unobserved hidden data 
as the dictionary and has complex pre-processing and post¬ 
processing with several parameters. The idea of incorpo¬ 
rating unobserved hidden data may also be considered in 
CASS. This will be our future work. 


(a) SSC 

Figure 4. The affinity matrices derived by (a) SSC, (b) LRR, (c) 
LSR, and (d) CASS on the MNIST database. 

Table 3. The segmentation accuracies (%) on the MNIST database. 



kNN 

SSC 

LRR 

LSR 

CASS 

ACC. 

61.00 

62.60 

66.80 

68.00 

73.80 


Table [2] shows the clustering result on the Extended Yale 
B database. We can see that CASS outperforms SSC, LRR 
and LSR on all these three clustering tasks. In particu¬ 
lar, CASS gets accuracies of 94.03%, 91.41%, and 81.88% 
for face clustering with 5, 8 , and 10 subjects, respectively, 
which outperforms the state-of-the-art method LSR. For the 
5 subjects face clustering problem, all these four methods 
perform well, and no big improvement is made by CASS. 
But for the 8 subjects and 10 subjects face clustering prob¬ 
lems, CASS achieves significant improvements. For these 
two clustering tasks, both LRR and LSR perform much bet¬ 
ter than SSC, which can be attributed to the strong grouping 
effect of the two methods. However, both the two meth¬ 
ods lack the ability of subset selection, and therefore may 
group some data points between clusters together. CASS 
not only preserves the grouping effect within cluster but 
also enhances the sparsity between clusters. The intuitive 
comparison of these four methods can be found in Figure 
[2] It confirms that CASS usually leads to an approximately 
block diagonal affinity matrix which results in a more accu¬ 
rate segmentation result. This phenomenon is also consis¬ 
tent with the analysis in Theorems [2] and [3] 

For semi-supervised learning, the comparison of the 
classification accuracies is shown in Figure [3] with differ¬ 
ent numbers of training data. CASS achieves the best per¬ 
formance and the accuracies on these settings are all above 
90%. Notice that they are much higher than the clustering 
accuracies in Table [2] This is mainly due to the mecha¬ 
nism of semi-supervised learning which makes use of both 
labeled and unlabeled data for training. The accurate graph 
construction is the key step for semi-supervised learning. 
This example shows that the affinity matrix by trace Lasso 
is also effective for semi-supervised learning. 

Table [3] shows the clustering accuracies by SSC, LRR, 
LSR, and CASS on the MNIST database. The compari¬ 
son of the derived affinity matrices by these four methods 
is illustrated in Figure [4] We can see that CASS obtains 
an affinity matrix which is close to block diagonal by pre- 
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serving the grouping effect. None of these four methods 
performs perfectly on this database. Nonetheless, our pro¬ 
posed CASS method achieves the best accuracy 73.80%. 
The main reason may lie in the fact that the handwritten 
digit data do not fit the subspace structure well. This is also 
the main challenge for real-world applications by subspace 
segmentation. 

4. Conclusions and Future Work 

In this work, we propose the Correlation Adaptive Sub¬ 
space Segmentation (CASS) method by using the trace 
Lasso. Compared with the existing SSC, LRR, and LSR, 
CASS simultaneously encourages grouping effect and spar¬ 
sity. The adaptive advantage of CASS comes from the 
mechanism of trace Lasso which balances between ^ 1 -norm 
and ^ 2 -norm. In theory, we show that CASS is able to reveal 
the true segmentation result when the subspaces are inde¬ 
pendent. The grouping effect of trace Lasso is firstly estab¬ 
lished in this work. At last, the experimental results on the 
Hopkins 155, Extended Yale B, and MNIST databases show 
the effectiveness of CASS. Similar improvement can also 
be observed in semi-supervised learning setting on the Ex¬ 
tended Yaled B database. However, there still remain many 
problems for future exploration. First, the data itself, which 
may be noisy, are used as the dictionary for linear construc¬ 
tion. It may be better to learn a compact and discriminative 
dictionary for trace Lasso. Second, trace Lasso may have 
many other applications, i.e. classification, dimensionality 
reduction, and semi-supervised learning. Third, more scal¬ 
able optimization algorithms should be developed for large 
scale subspace segmentation. 
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